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Introduction 

Recent  innovations  in  microprocessor  technology  have  resulted  in  a  variety  of  architec¬ 
tures  for  parallel  computers.  The  two  most  commonly  available  and  used  are  the  single 
instruction,  multiple  data  (SIMD)  architecture,  in  which  an  instruction  is  carried  out  simul¬ 
taneously  on  multiple  sets  of  operands,  and  the  multiple  instruction,  multiple  data  (MIMD) 
architecture,  where  several  different  instructions  may  be  concurrently  operating  on  several 
distinct  sets  of  operands.  Examples  of  SIMD  machines  are  the  Connection  Machines  CM-1, 
CM-2  and  CM-200  from  Thinking  Machines,  Inc.,  and  the  MasPar-1  from  MasPar.  Exam¬ 
ples  of  MIMD  machines  are  the  Intel  hypercubes  and  Meshes  (Gamma,  Delta  prototypes, 
Paragon),  as  well  as  the  nCUBE  and  Parsytech  hypercubes.  The  MIMD  multiprocessor  dis¬ 
tributed  memory  architecture  has  seen  the  largest  growth  in  the  recent  years  and  has  made 
significant  advances  in  user-fiiendliness. 

For  efficient  use  of  any  parallel  computer,  first  an  appropriate  architecture  has  to  be 
chosen  for  a  given  algorithm,  or  the  algorithm  has  to  be  teiilored  to  suit  the  given  architecture. 
Secondly,  the  problem  to  be  solved  is  split  into  several  pieces.  Each  piece  is  then  handed 
over  to  a  processor  with  interprocessor  information  transfer  provided.  Finally,  all  pieces  are 
assembled.  All  these  steps  should  be  done  without  incurring  substantial  overheeid  both  in 
terms  of  reduced  computational  efficiency  and  increeised  communication  costs. 

Typical  problems  of  interest  are  unsteady  aerodynamic  flows  and  turbulent  separating 
flows  over  complex  vehicles  such  as  submarines.  The  end  users  of  codes  developed  for  these 
problems  are  the  aerodynamic  and  hydrodynamic  vehicle  designers.  To  be  successful  in  a 
time-constrained  design  environment  a  code  must  be  capable  of  rapid  mesh  generation  and 
flow  computation  so  that  an  adequate  range  of  alternative  configurations  can  be  studied  in 
a  timely  manner. 

For  the  simulation  of  flows  about  complex  geometries  such  as  the  fully  appended  sub¬ 
marine,  an  unstructured  grid  approach  offers  the  greatest  flexibility  with  the  fewest  degrees 
of  freedom.  Furthermore,  the  method  allows  straightforward  adaptive  meshing  strategies  for 
dynamic  resolution  in  transient  problems.  This  paper  describes  the  parallel  implementation 
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of  FEFLOIC,  which  is  an  implicit  finite-element  incompressible  transient  Navier-Stokes  code 
for  the  simulation  of  2-D  flows.  The  important  aspects  of  this  algorithm  are  the  unstructured 
mesh  generation  and  the  implicit  flow  solver.  These  aspects  dictate  which  type  of  parallel 
computer  is  best  suitable.  For  example,  the  mesh  generation  using  the  advancing  front  al¬ 
gorithm  [I]  will  require  different  tasks  to  be  performed  in  different  subdomains.  This  will 
necessitate  the  use  of  a  MIMD  architecture.  The  implicit  flow  solver  employs  linelets  as  a 
preconditioner  to  achieve  improved  convergence  rates  [2,3).  The  use  of  a  SIMD  architecture 
would  imply  that  either  linelets  have  to  be  used  in  all  subdomains  or  the  linelets  are  not  used 
anywhere.  The  latter  case  implies  that  the  flow  solution  is  advanced  in  an  explicit  manner, 
resulting  in  a  reduced  computational  efficiency.  Therefore,  in  order  to  retain  an  implicit 
solver  in  regions  such  as  the  boundary  layer  and  the  capability  to  explicitly  advance  the  flow 
solution  in  regions  where  the  cell  sizes  eire  large,  one  has  to  resort  to  a  MIMD  £Lrchitecture. 
Another  disadvantage  of  the  SIMD  type  of  computer  is  that  the  communication  is  inefficient 
if  it  is  not  between  nearest  neighbours.  In  an  unstructured  mesh,  the  communication  pattern 
is  quite  irregular.  However,  mapping  techniques  [4]  can  be  applied  to  assign  vertices  of  the 
mesh  to  processors  of  the  computer,  thus  minimizing  the  communication  cost.  This  mapping 
is  quite  costly  and  is  not  pragmatic  in  situations  where  adaptive  remeshing  is  essential,  for 
example,  tracking  of  vortices  in  the  wake. 

The  present  research  effort  is  directed  towards  the  parallel  simulation  of  transient  flows 
with  high  Reynolds  numbers.  In  particular,  the  goal  is  to  simulate  transient  separating  flow 
about  a  fully  appended  submarine  and  a  self-consistent  maneuvering  trajectory.  Accurate 
representation  of  the  very  near-wall  behaviour  of  the  velocity  field  will  be  critical  in  realis¬ 
tically  simulating  the  unsteeidy  three  dimensional  separation  which  exists  on  maneuvering 
submarines.  This  will  necessitate  the  use  of  increasingly  fine  elements  (y+  <  10)  toward  the 
wall.  This  precludes  the  use  of  an  explicit  solver  due  to  the  prohibitively  small  timesteps 
associated  with  the  elements  in  the  near  wall  region.  Hence  in  the  present  formulation,  the 
pressure  as  well  as  the  advection-diffusion  terms  of  the  Navier-Stokes  equations  are  treated 
in  an  implicit  manner.  The  elliptic  equations  for  pressure  and  the  velocities  are  solved  us¬ 
ing  a  preconditioned  conjugate  gradient  algorithm.  Details  of  the  flow  solver  are  given  by 
Ldhner  and  Martin  [2,3].  This  algorithm  has  been  successfully  evaluated  by  Ramamurti  and 
Lohner  [5]  for  several  flow  problems,  such  as  the  flow  over  a  backward  facing  step,  the  flow 
past  a  circular  cylinder  and  imsteady  flow  over  airfoils. 

For  the  above  algorithm  to  be  parallelizable,  it  should  be  broken  into  several  smaller 
components  and  executed  without  incurring  substantial  overhead.  Load  balancing  and  com¬ 
munication  are,  therefore,  important  issues  to  be  addressed  in  order  to  reduce  this  overhead. 
The  primary  objective  in  load  balancing  is  to  divide  the  workload  equally  among  all  the 
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processors.  For  field  solvers  based  on  grids,  the  computational  work  is  proportional  to 
where  N  is  the  number  of  elements.  The  exponent  p  =  1  for  explicit  time-marching  or 
iterative  schemes;  p  >  1,  for  implicit  time-marching  schemes  or  direct  solvers.  This  implies 
that  optimal  load  balancing  is  achieved  if  each  processor  is  allocated  the  same  number  of 
elements.  Besides  the  CPU  time  required  in  each  processor  to  advance  the  solution  one 
time-step,  communication  overhead  between  the  processors  should  be  taken  into  account. 
This  overhead  is  directly  proportional  to  the  amount  of  information  that  needs  to  be  trans¬ 
ferred  between  processors.  For  field  solvers  based  on  grids,  the  amount  of  information  that 
needs  to  be  transferred  is  proportional  to  the  number  of  surface  faces  in  each  of  the  subdo¬ 
mains.  Therefore,  an  optimal  load  and  communication  balance  is  achieved  by  allocating  to 
each  processor  the  same  number  of  elements  while  minimizing  the  number  of  surface  faces 
in  each  subdomain.  The  load  balancing  algorithm  used  in  the  current  work  is  described  by 
Lohner  et.  ai.  [6].  This  algorithm  is  based  on  the  concept  that  elements  are  exchanged  be¬ 
tween  the  subdomains  along  the  interfaces.  The  final  subdivision  having  balanced  workload 
is  achieved  through  an  iterative  process.  A  heuristic  method  is  employed  to  minimize  the 
svuface  to  volume  ratio,  i.e.,  the  commimication  to  computational  load  ratio.  Furthermore, 
the  decomposition  insists  on  simply  connected  subdomains,  again  to  reduce  communication 
costs.  Additional  renumbering  of  the  subdomains  is  done  in  an  effort  to  minimize  the  number 
of  communication  hops  between  subdomains,  a  step  important  for  a  mesh  type  architecture 
such  as  the  Intel  Delta  prototype. 

In  this  paper,  a  description  of  the  implicit  flow  solver  and  the  load  balancing  algorithm 
are  given.  An  important  phase  in  the  the  flow  solver  is  the  solution  of  the  elliptic  equations 
for  the  velocities  and  pressure.  A  preconditioned  conjugate  gradient  algorithm  is  employed 
to  solve  these  equations.  As  a  first  step  of  the  parallelization  of  this  elliptic  solver,  a  model 
problem  of  solving  the  heat  conduction  equation  is  considered.  Results  are  obtained  for  the 
problem  of  a  circular  cylinder  in  a  heated  bath.  Next,  an  explicit  version  of  the  incompressible 
flow  solver  is  parallelized.  This  is  applied  for  solving  steady  flow  over  a  NACA0012  airfoil 
at  zero  angle  of  attack.  For  time  accurate  simulations,  this  explicit  algorithm  is  not  suitable 
since,  as  discussed  earlier,  it  would  require  prohibitively  small  timesteps.  Hence,  the  implicit 
flow  solver  with  linelets  as  the  preconditioner  for  solving  the  elliptic  equations  is  parallelized. 
With  some  initial  studies,  it  was  quite  apparent  that  the  linelets  should  not  be  split  between 
subdomains.  Hence,  the  partitioning  algorithm  was  modified  to  accomodate  a  set  of  complete 
linelets  within  a  subdomain.  The  algorithm  was  then  tested  via  simulation  of  unsteady  flow 
past  NACA0012  airfoil  at  q  =  20°.  Scalability  and  performance  studies  were  done  for  all  of 
the  above-mentioned  algorithms. 
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The  Incompressible  Flow  Solver 

The  incompressible  Navier-Stokes  equations  in  the  arbitrary  Lagrangian  Eulerian  (ALE) 
form  can  be  written  as 

d\ 

'^+Va- Vv  + Vp=  V-a  ,  (la) 

V  •  V  =  0  ,  (16) 

where 

Va  =  V-W.  (Ic) 

Here  p  and  a  denote  the  pressure  and  the  stress  tensor  scaled  by  density,  v  the  flow  velocity 
and  w  the  mesh  velocity. 

The  governing  equations  (la-b)  are  first  discretized  in  time  and  the  resulting  equations 
are  linearized.  This  is  followed  by  the  discretization  in  space.  Linear  elements  with  equal 
order  interpolation  for  velocities  and  pressures  are  used.  The  resulting  matrix  system  is 
solved  sequentially  using  a  projection-like  method  (7,8).  Thus,  the  velocities  are  updated  first. 
This  is  followed  by  the  solution  of  a  Poisson-type  equation  for  the  pressures.  The  role  of  this 
step  is  to  project  the  predicted  velocity  field  into  a  divergence-free  field.  Prom  the  pressures, 
a  new  velocity  field  is  obtained.  This  two-step  process  can  be  repeated  several  times  until 
convergence  is  obtained.  However,  it  is  seldom  applied  in  practice:  our  experience  indicates 
that  one  pass  is  sufficient  for  accurate  solutions.  The  large  linear  systems  of  equations  that 
appear  during  both  the  steps  are  solved  iteratively  with  a  preconditioned  conjugate  gradient 
algorithm.  As  a  preconditioner,  linelets  are  employed.  More  details  of  the  discretization,  the 
construction  of  the  linelets  and  the  solution  of  the  Poisson  equation  are  given  in  [2]. 

Domain  Decomposition 

A  simple  algorithm  based  on  the  ‘wavefront ’-type  scheme  is  used  to  obtain  a  subdivision 
into  NDOMN  subdomains.  Briefly,  this  algorithm  can  be  described  as  follows.  We  assume 
that  the  work  required  by  each  element,  a  desired  work  per  domain,  and  the  elements  that 
surround  each  point  are  given.  To  start  the  process,  any  given  point  can  be  selected.  All 
the  elements  surrounding  this  point  are  then  marked  as  belonging  to  the  present  subdomain. 
The  cumulative  work  for  the  elements  of  this  domain  is  updated.  The  points  of  each  of 
these  elements  which  are  unsurrounded  yet,  are  stored  in  a  list  LHELP.  If  the  desired  work 
per  domain  has  been  exceeded,  a  new  domain  is  started.  Otherwise,  the  next  point  to  be 
surrounded  is  selected  from  the  order  of  creation  list  LHELP.  The  procedure  continues  until  all 
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elements  have  been  allocated.  More  details  of  this  domain  splitting  algorithm  are  described 
in  [9). 

The  load  balancing  algorithm  assumes  that  an  initial  subdivision  with  NDOMN  subdomains 
is  supplied  through  the  domain  splitting  algorithm,  or  a  parallel  grid  generator  [9],  or  from 
a  previous  timestep  of  a  flow  computation.  Given  the  measure  of  work  required  by  each  ele¬ 
ment.  the  workload  in  each  subdomain  is  summed  up.  The  surplus/deflcit  of  this  workload 
with  the  desired  average  workload  is  computed  for  each  element.  A  deflcit  difference  func¬ 
tion  is  computed  in  each  element  that  reflects  the  imbalance  in  surplus/deficit  between  the 
element  and  its  neighbours.  Elements  are  then  added  to  subdomains  with  negative  deflcit 
function.  This  process  is  repeated  until  a  balanced  subdivision  is  obtained.  The  choice  of 
the  deficit  difference  function  and  details  of  the  algorithm  are  described  in  [6]. 

Parallel  Implementation 

Having  partitioned  the  domain  into  several  subdomains,  each  subdomain  is  assigned  to 
a  processor.  The  subdomains  formed  are  such  that  there  is  one  layer  of  elements  overlapped 
between  adjacent  subdomains.  The  information  at  the  vertices  associated  with  this  layer  of 
elements  are  communicated  between  processors. 

An  important  phase  in  the  flow  solver  algorithm  is  the  solution  of  the  elliptic  equations 
using  the  preconditioned  conjugate  gradient  algorithm  (PCG).  This  algorithm  has  been 
described  by  Saad  [10]. 

Given  an  initial  guess  xq  to  the  solution  of  the  linear  system  Ax  =  6,  the  PCG  algorithm 
is  as  follows. 

1.  Compute  the  preconditioner  based  on  the  linelets  M. 

2.  Start  the  process  by  setting  ro  =  6  -  Axo,  po  =  zo  =  M~^ro. 

3.  Iterate  until  convergence 

(a)  w  =  Api 

(b)  Oi  =  iri,Zi)/iw,pi) 

(c)  Xi+i  =  Xj  -i-  OiPi 

(d)  Tj+i  =  Ti  -  OiW 

(e)  Zi+i  = 

(f)  Pi  =  (ri+i,Xi+i)/(ri,2i) 

(g)  Pi+l  =  Zi+\  +  PiPi 

In  the  above  algorithm,  the  dot  products  in  step  36  and  3/  are  potential  bottlenecks  on 
many  parallel  or  vector  m^lchines.  This  is  because  when  all  the  vectors  in  the  algorithm  can  be 
split  equally  among  the  processors,  dot  products  require  global  communication.  Therefore, 
the  algorithm  is  synchronized  at  these  two  steps  and  a  global  sum  of  the  scalars  a  and  P 
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are  obtained.  This  is  important  for  the  stability  of  the  PCG  algorithm.  After  step  3g, 
the  relevant  components  of  the  vectors  p  and  x  are  exchar  between  the  proc-essors.  To 
facilitate  the  transfer  of  information  betweei.  .he  various  processors,  a  list  of  points  that  are 
to  be  sent  by  a  processor  and  those  to  be  received  by  it  are  maintained.  These  lists  consist 
of  the  processor  to  w.  ch  the  information  has  to  be  sent  or  received  from  and  the  local 
pointers.  The  information  that  is  sent  is  then  packed  into  a  common  block  and  communicated 
synchronously.  On  an  Intel  iPSC  860,  this  is  done  using  CSEND  and  CRECV  across  processors. 

The  commimi cation  overhead  can  be  reduced  by  performing  computations  in  regions 
which  do  not  depend  on  the  interface  boundaries.  Then  conununication  can  be  done  in 
an  asynchronous  manner  by  posting  all  the  IRECVs  before  the  CSENDs  on  an  iPSC  860. 
Fyfe  [11]  has  compared  the  communication  cost  beteween  the  asynchronous  and  synchronous 
communications  in  the  parallel  version  of  the  non-orthogonal  Flux  Corrected  Transport 
algorithm  based  on  structured  grids.  The  gain  through  asynchronous  communication  wais 
about  5%  of  the  total  time. 

In  order  to  take  advantage  of  this  asynchronous  communication,  first  the  interior  points 
of  a  subdomain  should  be  handled  separately  from  the  points  along  the  communication  in¬ 
terfaces.  In  an  unstructured  grid,  this  would  result  in  an  additional  computational  overhead. 
Secondly,  the  dot  products  in  steps  3b  and  3/  of  the  PCG  algoritm  have  to  be  completed 
before  anything  else  can  be  done  and  no  other  computation  can  be  performed  while  they  are 
being  computed.  Saad  [10]  has  outlined  a  few  ways  of  overcoming  this  difficulty  but  warns 
that  they  could  lead  to  an  unstable  algorithm.  Hence,  in  the  present  effort,  synchronous 
communication  was  deemed  the  most  appropriate  for  this  algorithm. 

Results  and  Discussion 

Model  Problem 

The  parallel  implementation  of  the  PCG  algorithm  was  first  tested  via  application  to  a 
model  problem  of  solving  the  heat  conduction  equation; 

pcp^  =  V  kVT  +  S  ,  (2a) 

T  =  To{t)  on  Ff)  ,  n-A:Vr=qo(0  on  F^  .  (26) 

Here  p,Cp,T,k,S  denote  the  density,  specific  heat  at  constant  pressure,  temperature,  con¬ 
ductivity,  sources  respectively.  The  boundary  conditions  are  prescribed  temperature  To  on 
Dirichlet  boundaries  F/j  and  prescribed  heat  flux  go  on  Neumann  boundaries  F^.  As  with  the 
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incompressible  flow  solver,  linear  finite  elements  are  used,  and  the  discretization  is  obtained 
from  the  standard  Galerkin  weighted  residual  method. 

The  problem  considered  for  this  purpose,  is  to  obtain  a  steady  state  temperature  distri¬ 
bution  around  a  cylinder  immersed  in  a  hot  bath.  In  order  to  obtain  a  steady  state  solution, 
the  explicit  version  of  the  elliptic  solver  with  diagonal  preconditioning  (no  linelets)  is  em¬ 
ployed.  For  scalability  studies  of  this  algorithm,  a  coarse  grid  consisting  of  1,526  points  and 
a  fine  grid  consisting  of  9,767  points  were  chosen.  The  performance  of  the  explicit  solver 
in  terms  of  both  the  speed-up  (S)  and  the  CPU  time  taken  per  point  per  timestep  (t)  are 
shown  in  figures  la  and  lb  respectively.  These  results  are  also  summarized  in  Table  1. 
The  performance  of  the  algorithm  both  with  and  without  I/O  operations  from  the  disk  are 
shown.  For  a  computationally  bound  problem,  such  as  the  simulation  of  a  transient  flow, 
the  performance  obtained  by  not  considering  the  I/O  will  be  a  true  representative  value  of 
the  overdl  performance  of  the  solver.  From  Fig.  la,  it  Cfin  be  seen  that  the  S  obtained  using 
the  coarse  grid  is  close  to  the  ideal  linear  curve  for  4  processors  and  drops  thereafter.  Also, 
it  is  clear  that  the  performance  in  terms  of  S  for  16  processors  is  decreased  from  8.0  to  6.2 
when  I/O  time  is  considered.  This  is  because  of  the  fact  that  the  number  of  points  in  each 
subdomain  is  only  around  100,  and  therefore,  the  I/O  time  is  an  overwhelming  part  of  the 
total  time.  The  I/O  overhead  can  be  reduced  if  one  were  to  use  unformatted  read/write 
operations.  As  a  finer  grid  is  used,  the  algorithm  scales  very  well  and  the  S  is  improved 
considerably,  achieving  a  value  of  almost  12.7  for  16  processors,  compared  to  a  value  of  8.0 
obtained  using  the  coarse  grid.  Even  for  the  finer  grid  employed,  the  number  of  points  per 
subdomain  with  16  processors  is  only  about  600,  out  of  which  almost  10%  of  these  points 
are  located  along  the  communication  interfaces.  Hence,  the  deviation  of  the  performance 
of  the  solver  without  I/O  when  16  processors  are  employed,  is  largely  due  to  the  increased 
communication/computation  ratio.  All  of  the  performance  results  obtained  in  the  present 
study  are  obtained  using  64-bit  floating-point  arithmetic.  This  is  important  because  the 
64-bit  arithmetic  is  the  appropriate  one  for  this  algorithm,  and  performance  results  based 
on  32-bit  arithmetic  could  be  misleading  [12j  if  one  were  to  compare  it  to  other  systems  such 
as  the  CRAY. 

The  finer  grid  that  was  employed  in  this  study  and  the  temperature  distribution  after 
500  timesteps  using  8  processors  are  shown  in  Figs.  2a  and  2b  respectively.  It  can  be  seen 
from  Fig.  2b  that  the  solution  has  achieved  steady  state.  This  solution  was  identical  to 
the  one  obtained  using  a  single  domain.  This  provides  reasonable  confidence  in  the  parallel 
implementation  of  the  elliptic  solver. 
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Explicit  Incompressible  Flow  Solver 

The  parallel  elliptic  sol  ■.  -  was  incorporated  into  the  incompressible  flow  solver.  Results 
were  obtained  for  steady  e  v  past  a  NACA0012  airfoil  at  a  =  0°  and  a  Reynolds  number 
Re  =  300.  For  this  vi^^  jus  flow,  the  grid  that  was  employed  is  semi-structured  and  consists 
of  2,992  points  and  5,860  triangular  elements.  This  coarse  grid  was  chosen  so  as  to  fit  into  one 
8K  process jr  on  the  Intel  iPSC  860  (Garruna)  available  at  the  Naval  Research  Laboratory. 
The  ^CG  algorithm  uses  diagonal  preconditioning,  which  is  equivalent  to  an  explicit  time- 
marching  scheme.  In  order  to  achieve  steady  state,  the  maximum  allowable  timestep  in  each 
element  is  employed. 

The  performance  results  of  this  explicit  incompressible  flow  solver  are  shown  in  Fig.  3 
^lnd  are  also  summarized  in  Table  2.  The  effect  of  compiler  optimization  on  the  performance 
is  also  shown.  A  speed-up  of  10.18  is  obtained  using  16  processors.  Prom  these  results  it 
is  clear  that  the  algorithm  is  scalable.  The  use  of  compiler  optimization  is  to  reduce  the  S 
obtained  with  16  processors  to  8.98.  This  reduction  is  due  to  the  fact  that  with  compiler 
optimization,  the  computational  time  per  processor  is  decreased  while  the  communication 
overhead  remains  a  constant.  It  is  clear  from  Table  2  that  for  16  processors,  the  CPU  time 
per  point  per  timestep  is  reduced  by  about  20%  with  the  use  of  compiler  optimization. 

The  Laboratory  for  Computational  Physics  and  Fluid  Dynamics  at  NRL  has,  as  a  result 
of  DARPA  and  NRL  support,  a  32-node  Intel  iPSC860  Touchstone  Gamma.  DARPA  has 
also  made  computing  time  available  to  us  on  the  512  node  Intel  Delta  prototype  at  CalTech. 
Scalability  studies  for  2,4,8,16  and  32  processors  were  carried  out  on  the  NRL  Gamma 
prototype  and  a  64  processor  computation  was  carried  out  on  the  Delta.  For  this  pupose 
a  finer  grid  consisting  of  approximately  twice  the  number  of  points  of  the  coarse  grid  was 
employed.  The  performance  results  on  the  Delta  and  the  comparison  with  the  Gamma  are 
shown  in  Fig.  4.  It  is  clear  that  the  code  is  scalable.  It  can  be  seen  that  as  the  number  of 
processors  is  increased  for  a  constant  problem  size,  the  S  deviates  from  the  ideal  linear  curve. 
This  is  a  result  of  the  increased  communication/computation  ratio.  There  is  considerable 
improvement  in  S  as  the  problem  size  is  increased,  as  would  be  expected.  Even  for  this  finer 
mesh,  the  performance  drops  down  when  the  number  of  processors  employed  is  greater  than 
16.  This  is  again  due  to  the  fact  that  the  communication  overhead  is  overwhelming  the  total 
computational  cost.  There  is  clearly  no  benefit  to  be  obtained  by  increasing  the  number  of 
processors  further  for  this  constaint  size  problem.  Figures  5a  and  5b  show  the  coarse  grid 
and  the  steady  state  pressure  in  the  vicinity  of  the  airfoil.  The  subdomain  interfaces  are 
superimposed  on  the  pressure  contours  in  Fig.  5b. 

This  explicit  flow  solver  was  then  applied  to  solve  the  inviscid  flow  past  a  complex  landing 
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configuration  tri-eleinent  airfoil.  The  grid  that  was  employed,  the  partitiong  consisting  of 
32  subdomains  are  shown  in  Figs.  6a  and  6b  respectively.  The  grid  consists  of  8,952  points 
and  17,127  elements.  The  results  obtained  at  o  =  5“  after  500  timesteps  are  shown  in  Fig. 
7.  This  figure  shows  the  pressure,  the  contours  of  absolute  velocity  and  the  velocity  vectors 
with  the  interface  boundzuies  superposed  on  each  of  them.  Figure  7c  shows  the  presence  of 
a  recirculating  region  behind  the  main  section  of  the  airfoil.  The  total  CPU  time  taken  for 
the  500  steps  is  14  minutes  and  6  seconds  resmcing  in  r  =  0.189  mSec/point/timestep. 

Implicit  Incompressible  Flow  Solver 

For  liif  simulation  of  unsteady  separated  flows,  the  typically  fine  elements  in  the  bound¬ 
ary  layer  would  force  prohibitively  small  timesteps  with  an  explicit  solver  for  the  advection 
terms.  Hence,  the  implicit  flow  solver  is  parallelized  next.  Here,  the  linelets  are  used  as  a 
preconditioner  in  the  PCG  algorithm.  The  parallel  algorithm  w-as  applied  for  solving  the 
unsteady  flow  past  the  NACA0012  airfoil  at  a  =  20°.  The  grid  that  w'as  employea  is  the 
coarse  grid  consisting  of  2,992  points.  This  grid  was  partitioned  as  before,  and  each  sub- 
domain  handed  to  a  processor.  Linelets  were  then  constructed  in  each  of  these  subdomains 
and  the  flow  was  advanced  in  each  of  these  subdomains.  It  was  quite  apparent  that  the  PCG 
algorithm  w'as  not  stable,  and  the  maximum  residue  in  the  equation  solved  would  occur 
near  the  communication  interface  where  a  linelet  is  split.  Hence,  the  domain  partitioning 
dgorithm  was  modified  to  accomodate  a  complete  set  of  linelets  within  a  subdomain.  This 
was  done  by  supplying  to  the  domain  splitting  algorithm  the  additional  information  about 
the  points  that  would  be  a  part  of  a  linelet.  First,  a  point  along  a  linelet  is  chosen.  All  the 
elements  surrounding  this  point  are  then  marked  belonging  to  the  present  subdomain.  The 
next  point  is  selected  from  the  order  of  creation  of  this  linelet.  The  process  is  continued  until 
all  the  points  along  this  linelet  are  exhausted.  When  the  number  of  elements  belonging  to 
this  subdomain  exceeds  a  set  average  value,  the  subdomain  counter  is  updated.  In  regions 
of  flow  where  the  large  aspect  ratio  cells  are  not  required,  such  as  the  region  outside  the 
boundary  layer  and  the  wake,  the  advection  terms  can  be  advanced  in  an  explicit  manner 
and  no  linelet  preconditioning  is  necessary.  These  regions  are  subdivided  next. 

The  comparison  of  the  performance  of  the  implicit  and  the  explicit  versions  of  the  code 
is  showm  in  Fig.  8.  It  is  clear  that  there  is  very  little  difference  in  S  obtained  due  to  the 
addition  of  the  linelet  preconditioning.  One  main  reason  for  this  is  in  the  boundary  layer 
regions  only  one  set  of  linelets,  aligned  closely  along  the  normal  to  the  airfoil,  is  employed. 
This  results  in  a  tridiagond  system  to  be  inverted  at  the  preconditioning  step  (Steps  2  and 
3e)  of  the  PCG  algorithm.  The  inversion  of  this  system  is  very  efficient.  Another  reason  is 
that  the  linelets  are  confined  to  regions  of  high  aspect  ratio  cells,  which  are  typically  in  the 
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boundary  layer  and  the  wake 

Figure  9a  shows  a  grid  consisting  of  6,785  points  and  13,376  elements  that  was  employed 
for  computing  the  flow  over  NACA0012  airfoil  at  a  =  20°  at  Re  —  300.  This  flow  adapted 
grid  was  obtained  from  a  previous  run  for  the  same  flow  configuration  on  a  vector  machine. 
This  grid  was  then  subdivided  into  32  subdomains.  Fig.  9b  shows  the  domain  decomposition 
in  the  vicinity  of  the  airfoil.  Figures  9c-e  show  the  pressure,  vorticity  emd  the  velocity  vectors 
of  the  unsteady  flow  after  25  characteristic  times.  The  Strouhal  number  St  for  this  flow  is 
approximately  0.43.  The  CPU  time  taken  for  500  timesteps  is  13  minutes  and  12.9  seconds 
resulting  in  r  =  0.2337  mSec/point/timestep.  A  grid  consisting  of  10,952  points  and  21,640 
elements  was  employed  for  computing  the  flow  over  a  NACA0012  airfoil  at  Re  =  1.4  x  10® 
using  the  CRAY  YMP,  Convex  C210  and  the  Intel  iPSC  860.  This  performance  study 
showed  that  the  CPU  time  taken  per  point  per  timestep  (t)  on  the  Intel  iPSC  860  using  32 
processors  is  0.2  mSec.  compared  to  0.16  mSec.  on  one  processor  YMP.  The  corresponding 
value  of  r  for  the  Convex  is  1.46  mSec. 

Summary  and  Conclusions 

An  incompressible  flow  solver  based  on  unstructured  grids  was  successfully  implemented 
on  a  MIMD  architecture,  viz.,  Intel  iPSC  860.  The  parallel  implementation  of  the  elliptic 
solver  was  tested  using  the  heat  conduction  equation  as  a  model  problem.  The  parallelized 
elliptic  solver  was  then  incorporated  into  the  explicit  and  implicit  versions  of  the  incompress¬ 
ible  flow  solver.  Performance  studies  were  performed  on  all  of  these  algorithms.  Scalability 
studies  were  performed  on  both  the  Gamma  and  the  Delta  prototypes,  and  show  that  the 
code  is  scalable.  A  parallelizable  load  balancing  algorithm  was  developed  to  be  used  in 
conjunction  with  the  incompressible  flow  solver.  Future  developments  will  be  focused  on 
the  incorporation  of  adaptive  regridding  in  2-D  and  the  parallel  implementation  of  the  3-D 
incompressible  flow  solver. 
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Fig.  1.  Performance  of  FEHEAT 
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a.  Grid.  npoin=  9,767,  nelem=  29,236,  ndomn=  8 


b.  Temperature  Distribution,  min=  0.0,  max=  334.0,  AT=  16.7 
Fig.  2.  Model  Problem  of  Heat  Conduction  around  a  Cylinder 
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Performance  of  the  Explicit  Flow  Solver 


erformance  of  the  Explicit  Flow  Solver  on  Gamma  and  Delta 
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Fig.  5.  Flow  Past  a  NACA0012  Airfoil,  a  =  0“ 

b.  Pressure,  min.=  0.845,  max.=  1.61,  A  p=  1.92E^02 
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Fig.  6a.  Unstructured  Grid  Around  a  IVi-Element  Airfoil 


Fig.  6b.  Domain  Decomposition  of  the  Grid  around  a  IVi-Element  Airfoil 


ndomn  =  32 


Fig.  7.  Flow  Past  a  TVi-Element  Airfoil,  a  =  5® 

a.  Pressure,  min.=  -2.03,  max.=  1.48,  A  p=  8. 
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Fig.  7.  Flow  Past  a  TVi-Element  Airfoil,  a  =  5® 


b.  Absolute  Velocity, 


min.=  2.63E-03,  max.=  2.54,  A  V=  6.34E)-02 
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Fig.  7.  Flow  Past  a  TVi-Element  Airfoil,  a 
c.  Velocity  Vectors 
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Fig.  8.  Performance  of  the  Implicit  Flow  Solver 


Fig.  9.  Flow  Past  a  NACA0012  Airfoil,  a  =  20° 
a.  Grid,  npoin  =  6,785,  nelem  =  13,376 
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Fig.  9.  Flow  Past  a  NACA0012  Airfoil,  a  =  20° 
b.  Domain  Decomposition,  ndomn  =  32 
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Fig.  9.  Flow  Past  a  NACA0012  Airfoil,  q  =  20® 

c.  Pressure,  min.=  0.38,  max.=  1.55,  A  p=  2.93E-02 
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Fig.  9.  Flow  Past  a  NACA0012  Airfoil,  o  =  20° 

d.  Vorticity,  min.=  -1.44E+02,  max.=  2.87E+01,  Aw  =  2.16 
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Fig.  9.  Flow  Past  a  NACA0012  Airfoil,  a  =  20“ 
e.  Velocity  Vectors 
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